Effective encoder-decoder neural network for segmentation of orbital tissue in computed tomography images of Graves’ orbitopathy patients

Purpose To propose a neural network (NN) that can effectively segment orbital tissue in computed tomography (CT) images of Graves’ orbitopathy (GO) patients. Methods We analyzed orbital CT scans from 701 GO patients diagnosed between 2010 and 2019 and devised an effective NN specializing in semantic orbital tissue segmentation in GO patients’ CT images. After four conventional (Attention U-Net, DeepLab V3+, SegNet, and HarDNet-MSEG) and the proposed NN train the various manual orbital tissue segmentations, we calculated the Dice coefficient and Intersection over Union for comparison. Results CT images of the eyeball, four rectus muscles, the optic nerve, and the lacrimal gland tissues from all 701 patients were analyzed in this study. In the axial image with the largest eyeball area, the proposed NN achieved the best performance, with Dice coefficients of 98.2% for the eyeball, 94.1% for the optic nerve, 93.0% for the medial rectus muscle, and 91.1% for the lateral rectus muscle. The proposed NN also gave the best performance for the coronal image. Our qualitative analysis demonstrated that the proposed NN outputs provided more sophisticated orbital tissue segmentations for GO patients than the conventional NNs. Conclusion We concluded that our proposed NN exhibited an improved CT image segmentation for GO patients over conventional NNs designed for semantic segmentation tasks.


Introduction
The orbit of the eye is a complex anatomical region, which can be evaluated by a variety of radiological modalities. Computed tomography (CT) is the most widely used diagnostic imaging modality for diagnosing orbital pathologies. Moreover, recent technical advances in the three-dimensional analysis of CT images have enabled the quantitative measure of extraocular muscles, lacrimal glands, and orbital fat [1]. For example, Graves' orbitopathy (GO) is a wellknown orbital disease with extrathyroidal features of dysthyroidism [2]. Using CT to quantitative analyze orbital tissue has become essential for the clinical assessment of GO activity or severity [3]. Currently, however, the manual segmentation of orbital tissues from the CT images remains a labor-intensive and time-consuming process and can be observer-dependent [4,5].
Previous studies have used neural networks (NNs) to identify, discriminate, and grade various diseases [6][7][8]. In ophthalmology, NNs are used in the grading of diabetic retinopathy [9], age-related macular degeneration [10], and glaucoma screening [11]. Semantic segmentation of the orbital area using NNs can be a successful application of machine learning techniques because tissues with different Hounsfield units are clustered within a narrow orbit. Recently, several attempts have been made to segment orbital bones, orbital fat, or eye globes in both orbital CT and magnetic resonance imaging using NNs [12][13][14]. However, due to the architecture of conventional NNs, the restoration performance of the decoder can be limited, which can result in output segments with rough boundaries, particularly when the analyzed CT images contain deformed orbital tissue.
In this study, we propose an effective NN based on encoder-decoder architecture to improve tissue segmentation quality in GO patients. To validate the superiority of the proposed NN, we compare the performance of our proposed NN against four conventional NNs: Attention U-Net [15,16], DeepLab V3+ [17], SegNet [18], and HarDNet-MSEG [19]. The five NNs are applied to orbital CT images to segment the eyeball, optic nerve, lacrimal gland, and extraocular rectus muscles. We also conduct an in-depth analysis by comparing the segmentation results given by the proposed NN and conventional NNs.

Materials and methods
This study is a retrospective comparative effectiveness research study. The protocol was approved by the institutional review board of the Chung-Ang University Hospital (IRB No. 2112-013-19395) and complies with the tenets of the Declaration of Helsinki. The requirement for informed consent was waived by the institutional review board because of the retrospective nature of the study.

Participants
We obtained the orbital CT images (Philips Brilliance 256 Slice iCT, Philips Healthcare Systems, Andover, MA, USA) from 701 GO patients diagnosed between January 2010 and October 2019. Continuous axial scanning was performed with the patient's head positioned parallel to the Frankfort plane while looking at a fixed point. The scanning parameters were 120 kV, 150 mAs, 64 x 0.625 mm detector configuration, 1.0 mm slice thickness, and 1.0 mm slice increment. Patients with orbital tumors, orbital bone fractures, other orbital structural deformities, or ocular muscle surgery histories were excluded from the study.

Image acquisition and manual segmentation
Two axial slices and one coronal slice from the respective CT images were selected for each subject. The axial slice displaying the largest eyeball volume was selected and designated as Axial 1; the axial slice expressing the largest lacrimal gland amount was selected and designated as Axial 2. The coronal slice image exhibiting the area located halfway between the eyeball-optic nerve junction and the inner exit of the optic nerve within the orbit was selected and designated as Coronal.
The boundaries of the eyeball, optic nerve, medial rectus muscle, and lateral rectus muscle are outlined in the Axial 1 image, and the boundary of the lacrimal gland is outlined in the Axial 2 image. The boundaries of the optic nerve, medial, lateral, superior, and inferior rectus muscles are outlined in the Coronal image (Fig 1). Manual segmentation was performed by a single observer using ImageJ software ver. 1.46 (National Institutes of Health, Bethesda, MD, USA; http://rsbweb.nih.gov/ij/).

Proposed and conventional neural networks
A Fully Convolutional Network (FCN) is a base NN with an architecture that can influence additional NNs for semantic segmentation [20], as FCN variations share a similar architecture with some modifications. For example, U-Net [21], one of the most successful NNs in medical image segmentation, is a modified version of an FCN that strengthens the symmetricity of encoderdecoder architecture. Motivated by this point, we devised our NN for the semantic orbital tissue segmentation in GO patients' CT images based on the encoder-decoder architecture (Fig 2).
The goal of a five-block encoder is to extract features that contain important information, such as orbital tissue location and boundary. The first two and last three blocks contain two and three convolution layers, respectively. The convolution layer sequentially performs convolution, batch normalization, and rectified linear unit operations. After the convolution layer, one max pooling operation filters out unnecessary information. Because of convolution and max pooling operations, the image size reduces gradually as the original image passes through all five blocks.
Extracted features of each block encompass the image's varying pixel numbers or area size. For example, the features extracted from the first block cover a small area of the original image; hence, these lowest-level features can help capture the detailed boundary of orbital tissues with complicated shapes, such as the optic nerve. In contrast, the highest-level features extracted from the last block cover a large area of the original image. They can locate the orbital tissue in the image and resist the local noise caused by any deformities. Therefore, lowlevel features can achieve high-quality tissue segmentation and refine the segmentation boundary by starting from where the highest-level features indicate. In addition, since each block extracts features from the reduced image, the decoder can be designed with counterpart blocks against each encoder block to ensure the compatibility of feature coverage.
The five-block decoder has symmetrical architecture to the encoder, but the max pooling layer of the encoder is replaced with an un-pooling layer for restoring the reduced image to its original size [18]. Initially, the decoder applies an un-pooling operation to the highest-level features delivered from the last block of the encoder. Because up-pooling operation does not require any additional parameters to be trained, more refined values for other trainable parameters can be obtained within a limited number of training epochs. Based on the max pooling indexes used in the pooling layer of the counterpart block in the encoder, image size can be restored by duplicating the feature value to the corresponding location and padding the remaining locations to zero. Next, the features extracted from the counterpart blocks of the encoder are delivered through the skip connection to refine the rough boundary generated through the un-pooling process or alleviate the original boundary's information loss from the encoding process [21]. These processes are repeated five times until the input image is restored to its original size. As a result, the proposed NN maintains multi-level information to improve the segmentation quality of the orbital tissue of GO patients. The source code of the proposed NN is available at https://github.com/tkdgur658/OTSNet.
This study used four NN semantic segmentation types to verify the proposed NN's performance: Attention U-Net, DeepLab V3+, SegNet, and HarDNet-MSEG. Attention U-Net, Dee-pLab V3+, and SegNet are the medical field's most widely used FCN variants, and HarDNet- MSEG is a medical image segmentation model equipped with the latest NN design techniques. Attention U-Net is a recently improved U-Net variant [22][23][24] and the most widely used model for medical image segmentation tasks, including orbital structure segmentation. Dee-pLab V3+, the latest version in the DeepLab series [17,25,26], is frequently used for medical image segmentation because of the Atrous convolution's effectiveness [27][28][29][30]. DeepLab V3 + delivers fewer extracted feature types from the encoder than the proposed NN, resulting in a rough segmentation boundary from information loss. Although SegNet was initially developed for road scene images, many variants are certified for medical imaging [31][32][33][34]. We used the parameter settings for SegNet suggested by Chandra et al. [34]. Contrasting the proposed NN, SegNet does not exploit additional encoder information except for max pooling indexes during the decoding process. HarDNet-MSEG uses HarDNet as its backbone network and incorporates both received field blocks and cascaded partial decoders for segmentation tasks. HarD-Net-MSEG demonstrated promising medical image segmentation performance owing to the receiver field block [35,36], which is why it was chosen as a compared method in this study. However, unlike the proposed NN, HarDNet-MSEG does not pass low-level feature maps to the decoder. Fig 3 illustrates the flowchart of the overall training and test process. The size of the CT images was 512 x 512, and the output was the same size. Before preprocessing, each integer pixel value ranged from -1024 to 3071. The pixels were normalized from 0 to 1 through VOI LUT operation with a Window center of 0 and a Window width of 200. The predicted output value was transformed to 0 for the background and 1 for the orbital tissue by the sigmoid function with a threshold of 0.5. We used Tversky Focal Loss and conducted the tests using a weight with a minimum validation loss value out of 50 epochs. We implemented the five NNs with the Pythorch (1.10.1) library and conducted all experiments in a Geforce RTX 3090 24 GB environment. Hyperparameters comprised batch size, optimizer, learning rate, and weight decay, which were set to 16, AdamW optimizer, 1e-3, and 1e-4, respectively.

Evaluation
Manually segmented images were used as the ground truth to compare training and performance with the results of the five NNs. For training and evaluation, we randomly split the 701 datasets into training, validation, and test sets with a ratio of 0.7 (training) to 0.15 (validation) to 0.15 (test). Training test sets were repeated ten times for statistic calculations. Overall segmentation performances were evaluated using the Dice coefficient and Intersection over Union (IoU).

Statistical analysis
The Dice coefficient and IoU were compared between the five NNs. The values were represented as the mean value with standard deviations. The four remaining NNs' value distance from the highest segmentation performance criterion (Dice coefficient and IoU) was statistically analyzed using a paired t-test. All statistical analyses were performed using the Python library SciPy (https://www.scipy.org), with a p-value of < 0.001 denoting statistical significance.

Results
The age and sex of the 701 participants are shown in Table 1. The number of female participants (503, 71.8%) was much higher than male participants.
Comparisons of the performance of the five NNs, based on the Dice coefficient and IoU for each of the three images, are shown in Tables 2 and 3. In the Axial 1 and 2 images, all the Dice coefficients and IoU values for the segmentation of the five target tissues, including the eyeball,   Table 2). The Dice coefficients and IoU values for the five tissues were lowest in the HarDNet-MSEG. In the Coronal image, the Dice coefficient and IoU value for the segmentation of the optic nerve were highest in the proposed NN, followed by SegNet ( Table 3). The Dice coefficients and IoU values for the segmentation of the four rectus muscles were also highest in the proposed NN, followed by Attention U-Net (Table 3). We conducted a qualitative analysis to demonstrate the superiority of our proposed NN over conventional NNs. We chose SegNet as the counterpart of the proposed NN because of its effectiveness and simple architecture. Fig 4 shows the comparison results between the proposed NN and SegNet. The figure's first column indicates the input images with the target orbital tissue and the ground truth boundary. The second and third columns signify the segmentation results of SegNet and the proposed NN with white pixels in the ground truth boundary. Thus, perfect orbital tissue segmentation can be confirmed if the ground truth region is fully filled with white pixels. Our results indicated that the proposed NN provides better segmentation results than SegNet, possibly because SegNet does not exploit the multi-level features extracted from the encoding process, except for the max pooling indexes, and also fails to produce sophisticated orbital tissue boundaries.

Discussion
Several studies have attempted to segment and measure various orbital component areas and volumes using artificial intelligence in orbital CT images (Table 4) [13,14,[37][38][39][40][41]. However, the model best suited for semantic orbital tissue segmentation in GO patients' CT images remains unknown. Therefore, it is necessary to determine what NN characteristics are suitable for semantic segmentation. Finding a proper network for semantic segmentation is not only applicable for orbital CT images but essential for supervised learning in many fields of ophthalmology. For example, NNs have been used to segment intraretinal fluid cysts or subretinal fluid on retinal images [42]. Concepts discussed in this study may also apply to other fields of ophthalmology if developed further. NN application in ophthalmology is currently nascent but has many potential clinical uses. One study proposed a predictive tool for reliable segmentation for patient-specific orbital reconstruction in blow-out fractures. The mean Dice coefficient was 88.1% for the automated segmentation of orbital volume using CT scans compared with manual segmentation [43]. Another study attempted three-dimensional reconstruction by automatically segmenting the extraocular muscle and the optic nerve, reporting an 82.1% IoU [38]. In our study, the proposed NN performed well for the various CT planes, with at least an 87.2% Dice coefficient. These results were similar to those of comparable studies and have shown the potential for NNs to replace manual segmentation. In this study, semantic segmentation of the eyeball showed a high level of accuracy compared with other tissues in CT axial images, probably due to the high contrast of the Hounsfield units for the vitreous cavity and the surrounding fats in the CT images. Since the eyeball contour was always automatically drawn well, developing three-dimensional images of the eyeball using CT imaging will soon be possible. Lacrimal gland size can be affected by various lacrimal tumors and inflammatory conditions, such as GO or pseudotumors [44]. However, there have been no reports of semantic lacrimal gland segmentation using CT images. The Dice coefficient of the lacrimal gland was 87.2% with the proposed NN, which was lower than that of other tissues. This may be caused by the limitations of the CT images, as the lacrimal gland is indistinguishable from the eyeball and periocular tissues. Nevertheless, the resulting accuracy could be applicable in clinical practice. We established the possibility for diverse semantic orbital tissue segmentation using a NN exhibiting a high agreement level with manual segmentation. Specifically, we devised a NN for semantic segmentation by referencing the architecture and operations to help obtain sophisticated orbital tissue segments in CT images potentially deformed due to GO. Our proposed NN differed from conventional NNs as it specializes in semantically segmenting orbital tissues in GO patients' CT images. For example, considering multi-level feature exploitation, Attention U-Net's highest-level features are lower than the proposed NN. Attention U-Net's minimum feature map is 1/8 of the input size, whereas the proposed NN is 1/16. This difference results from convolution block numbers, including multiple convolutions and poling layers. Therefore, the proposed NN is more efficient regarding orbital tissue size changes due to GO by adding high-level information to the decoder. Similarly, DeepLab V3+, SegNet, and HarDNet-MSEG deliver features extracted from 1/4 and 1/8, with no information, and 1/4, 1/8, and 1/16 of the input size, respectively. In contrast, the proposed NN delivers 1, 1/2, 1/4, 1/8, and 1/16 of the input image to the decoder through the skip connection. DeepLab V3+ and HarDNet-MSEG ignore delivering low-level features to the decoder. Concerning SegNet, the multi-level features' information loss is intense because it adopts the pooling indexes approach, which is covered qualitatively in Fig 4. This encoder's strong exploitation power complements the unpooling technique simplicity, remedying parameter turning difficulties, and improving the segmentation performance. Consequently, the proposed NN outperformed conventional NNs and could reduce the time and effort required for complex manual segmentation. Moreover, the proposed NN (34 million) has a slightly larger parameter size than SegNet (29 million) due to its multi-level skip connection. However, while they have similar architecture, the proposed NN still has a minimal parameter size compared to DeepLabV3+ (59 million).
Although we could investigate orbital tissue segmentation performance using various NNs in multiple CT planes, there were some limitations. First, the manual segmentation of the images was confirmed by a single ophthalmologist and was taken as ground truth. For more reliable semantic segmentation, multiple specialists should be consulted regarding the segmentation of orbital tissues to reduce segmentation errors. Since CNN-based deep learning requires considerable data, a larger dataset can further improve segmentation performance. Additionally, there was potential bias in the manual axial and coronal cuts representation choice. For example, the Axial 2 image is a slice of the maximum lacrimal gland area. However, the selection of the slice could differ among individuals. Considering technology, the proposed NN's naïve skip connection use may be a possible limitation. Existing works have proposed various skip connection approaches, such as convolution in skip connection and connecting multi-level features from multiple encoders to one decoder. Thus, future research should focus on multi-level feature use in decoding, and the detailed design for their effective combination must be investigated more thoroughly.
In conclusion, we introduced an effective encoder-decoder NN of orbital tissue segmentation in GO patients' CT. The proposed encoder delivers low-and high-level features to the decoder for capturing clear boundaries and concurrently resisting local noise. Then, the decoder exploits these features with an un-pooling operation and skip connection for effective image size restoration. The experimental results indicated that the proposed architecture significantly outperformed four conventional NNs types designed for semantic segmentation. Technically, the proposed model encourages using multi-level decoding features to obtain sophisticated target boundaries potentially deformed due to GO. This study provides a fundamental basis for automatic GO evaluation, which could replace manual CT image evaluation if developed further.